OcrV1, Main, Exploration, bibRecord, 000312

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Identifieur interne : 000312 ( Main/Exploration ); précédent : 000311; suivant : 000313

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Auteurs : Roger Sayle [Royaume-Uni] ; Paul Hongxing Xie [Suède] ; Sorel Muresan [Suède]

Source :

Journal of chemical information and modeling [ 1549-9596 ] ; 2012.

RBID : Pascal:12-0102438

Descripteurs français

Pascal (Inist)
- Fouille donnée, Texte, Bioinformatique, Ontologie, Reconnaissance caractère, Reconnaissance optique caractère, Rupture, Brevet, Propriété industrielle, Dictionnaire automatique, Industrie pharmaceutique, Nomenclature, Gène, Homme, Typographie, Erreur humaine, Correction automatique, Trait union, ..
Wicri :
- topic : Brevet, Propriété industrielle, Industrie pharmaceutique, Nomenclature, Homme.

English descriptors

KwdEn :
- Automatic correction, Automatic dictionary, Bioinformatics, Character recognition, Data mining, Gene, Human, Human error, Hyphen, Nomenclature, Ontology, Optical character recognition, Patent rights, Patents, Pharmaceutical industry, Rupture, Text, Typography.

Abstract

The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.

Affiliations:

Royaume-Uni, Suède

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000102
to stream PascalFrancis, to step Curation: 000670
to stream PascalFrancis, to step Checkpoint: 000074
to stream Main, to step Merge: 000315
to stream Main, to step Curation: 000312

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</title>
<author><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>NextMove Software</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">12-0102438</idno>
<date when="2012">2012</date>
<idno type="stanalyst">PASCAL 12-0102438 INIST</idno>
<idno type="RBID">Pascal:12-0102438</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000102</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000670</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000074</idno>
<idno type="wicri:doubleKey">1549-9596:2012:Sayle R:improved:chemical:text</idno>
<idno type="wicri:Area/Main/Merge">000315</idno>
<idno type="wicri:Area/Main/Curation">000312</idno>
<idno type="wicri:Area/Main/Exploration">000312</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction</title>
<author><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>NextMove Software</s1>
<s2>Cambridge</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<wicri:noRegion>NextMove Software</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
<affiliation wicri:level="1"><inist:fA14 i1="02"><s1>Discovery Sciences, Computational Sciences, AstraZeneca R&D Mölndal</s1>
<s2>431 83 Mölndal</s2>
<s3>SWE</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Suède</country>
<wicri:noRegion>431 83 Mölndal</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Journal of chemical information and modeling</title>
<title level="j" type="abbreviated">J. chem. inf. model. </title>
<idno type="ISSN">1549-9596</idno>
<imprint><date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Journal of chemical information and modeling</title>
<title level="j" type="abbreviated">J. chem. inf. model. </title>
<idno type="ISSN">1549-9596</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic correction</term>
<term>Automatic dictionary</term>
<term>Bioinformatics</term>
<term>Character recognition</term>
<term>Data mining</term>
<term>Gene</term>
<term>Human</term>
<term>Human error</term>
<term>Hyphen</term>
<term>Nomenclature</term>
<term>Ontology</term>
<term>Optical character recognition</term>
<term>Patent rights</term>
<term>Patents</term>
<term>Pharmaceutical industry</term>
<term>Rupture</term>
<term>Text</term>
<term>Typography</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Fouille donnée</term>
<term>Texte</term>
<term>Bioinformatique</term>
<term>Ontologie</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Rupture</term>
<term>Brevet</term>
<term>Propriété industrielle</term>
<term>Dictionnaire automatique</term>
<term>Industrie pharmaceutique</term>
<term>Nomenclature</term>
<term>Gène</term>
<term>Homme</term>
<term>Typographie</term>
<term>Erreur humaine</term>
<term>Correction automatique</term>
<term>Trait union</term>
<term>.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Brevet</term>
<term>Propriété industrielle</term>
<term>Industrie pharmaceutique</term>
<term>Nomenclature</term>
<term>Homme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">The text mining of patents of pharmaceutical interest poses a number of unique challenges not encountered in other fields of text mining. Unlike fields, such as bioinformatics, where the number of terms of interest is enumerable and essentially static, systematic chemical nomenclature can describe an infinite number of molecules. Hence, the dictionary- and ontology-based techniques that are commonly used for gene names, diseases, species, etc., have limited utility when searching for novel therapeutic compounds in patents. Additionally, the length and the composition of IUPAC-like names make them more susceptible to typographic problems: OCR failures, human spelling errors, and hyphenation and line breaking issues. This work describes a novel technique, called CaffeineFix, designed to efficiently identify chemical names in free text, even in the presence of typographical errors. Corrected chemical names are generated as input for name-to-structure software. This forms a preprocessing pass, independent of the name-to-structure software used, and is shown to greatly improve the results of chemical text mining in our study.</div>
</front>
</TEI>
<affiliations><list><country><li>Royaume-Uni</li>
<li>Suède</li>
</country>
</list>
<tree><country name="Royaume-Uni"><noRegion><name sortKey="Sayle, Roger" sort="Sayle, Roger" uniqKey="Sayle R" first="Roger" last="Sayle">Roger Sayle</name>
</noRegion>
</country>
<country name="Suède"><noRegion><name sortKey="Hongxing Xie, Paul" sort="Hongxing Xie, Paul" uniqKey="Hongxing Xie P" first="Paul" last="Hongxing Xie">Paul Hongxing Xie</name>
</noRegion>
<name sortKey="Muresan, Sorel" sort="Muresan, Sorel" uniqKey="Muresan S" first="Sorel" last="Muresan">Sorel Muresan</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000312 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000312 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:12-0102438
   |texte=   Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri